Statistical Outlier Detection in Large Multivariate Datasets
نویسندگان
چکیده
This work focuses on detecting outliers within large and very large datasets using a computationally efficient procedure. The algorithm uses Tukey’s biweight function applied on the dataset to filter out the effects of extreme values for obtaining appropriate location and scale estimates. Robust Mahalanobis distances for all data points are calculated using these location and scale estimates. A suitable rejection point for the outliers is determined by a separation boundary obtained using non-parametric density estimation by Parzen window where the probability density curve of the robust Mahalanobis distances descends and then again ascends for outlying distances. This procedure demonstrates good success at identifying outliers even in cases where data is highly skewed and overlapping, compared to established statistical outlier detection methods for both univariate and multivariate data where the underlying distribution needs to be known.
منابع مشابه
Detecting outliers in high-dimensional neuroimaging datasets with robust covariance estimators
Medical imaging datasets often contain deviant observations, the so-called outliers, due to acquisition or preprocessing artifacts or resulting from large intrinsic inter-subject variability. These can undermine the statistical procedures used in group studies as the latter assume that the cohorts are composed of homogeneous samples with anatomical or functional features clustered around a cent...
متن کاملAn Empirical Comparison of Outlier Detection Methods
Four outlier detection methods are compared using both publicly available smaller statistical datasets and real-life Knowledge Discovery in Databases (KDD) datasets [1]. The smaller datasets provide insight (via visualisations) into the relative strengths and weaknesses of the compared methods. The real-life large datasets test scalability and practicality of application. We are unaware of prev...
متن کاملIdentification of outliers types in multivariate time series using genetic algorithm
Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...
متن کاملZ-Glyph: Visualizing outliers in multivariate data
Outlier analysis techniques are extensively used in many domains such as intrusion detection. Today, even with the most advanced statistical learning techniques, human judgment still plays an important role in outlier analysis tasks due to the difficulty of defining and collecting outlier examples. This work seeks to tackle this problem by introducing a new visualization design, ‘‘Z-Glyph,’’ a ...
متن کاملMultivariate Outlier Detection Using Independent Component Analysis
The recent developments by considering a rather unexpected application of the theory of Independent component analysis (ICA) found in outlier detection , data clustering and multivariate data visualization etc . Accurate identification of outliers plays an important role in statistical analysis. If classical statistical models are blindly applied to data containing outliers, the results can be ...
متن کامل